208 research outputs found

    Improving average ranking precision in user searches for biomedical research datasets

    Full text link
    Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorisation method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries. Our system provides competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP among the participants, being +22.3% higher than the median infAP of the participant's best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system's performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. Our similarity measure algorithm seems to be robust, in particular compared to Divergence From Randomness framework, having smaller performance variations under different training conditions. Finally, the result categorization did not have significant impact on the system's performance. We believe that our solution could be used to enhance biomedical dataset management systems. In particular, the use of data driven query expansion methods could be an alternative to the complexity of biomedical terminologies

    Design of an Integrated Analytics Platform for Healthcare Assessment Centered on the Episode of Care

    Full text link
    Assessing care quality and performance is essential to improve healthcare processes and population health management. However, due to bad system design and lack of access to required data, this assessment is often delayed or not done at all. The goal of our research is to investigate an advanced analytics platform that enables healthcare quality and performance assessment. We used a user-centered design approach to identify the system requirements and have the concept of episode of care as the building block of information for a key performance indicator analytics system. We implemented architecture and interface prototypes, and performed a usability test with hospital users with managerial roles. The results show that by using user-centered design we created an analytical platform that provides a holistic and integrated view of the clinical, financial and operational aspects of the institution. Our encouraging results warrant further studies to understand other aspects of usability

    DS4DH at #SMM4H 2023: Zero-Shot Adverse Drug Events Normalization using Sentence Transformers and Reciprocal-Rank Fusion

    Full text link
    This paper outlines the performance evaluation of a system for adverse drug event normalization, developed by the Data Science for Digital Health group for the Social Media Mining for Health Applications 2023 shared task 5. Shared task 5 targeted the normalization of adverse drug event mentions in Twitter to standard concepts from the Medical Dictionary for Regulatory Activities terminology. Our system hinges on a two-stage approach: BERT fine-tuning for entity recognition, followed by zero-shot normalization using sentence transformers and reciprocal-rank fusion. The approach yielded a precision of 44.9%, recall of 40.5%, and an F1-score of 42.6%. It outperformed the median performance in shared task 5 by 10% and demonstrated the highest performance among all participants. These results substantiate the effectiveness of our approach and its potential application for adverse drug event normalization in the realm of social media text mining

    Modélisation et maquettage d’une interface de gestion des métadonnées en bibliothèque

    Get PDF
    Le modèle conceptuel des métadonnées bibliographiques actuellement le plus répandu en bibliothèque date de la création du catalogage informatique dans les années soixante. Il a été créé dans le but de décrire des ressources physiques et ne correspond plus aux besoins et objectifs des interfaces publiques des catalogues et plateformes d’aujourd’hui. Deux modèles concurrents sont actuellement en train d’être adoptés en bibliothèques : Bibframe et LRM. Ces modèles, plus adaptés à l’écosystème du web des données, favorisent l’échange des métadonnées et ont pour objectif de décloisonner les ressources contenues dans les catalogues des bibliothèques. Tous deux proposent avec plus ou moins de granularité des structures conceptuelles de trois ou quatre niveaux descriptifs et se basent sur de nouveaux standards du web et du domaine de la bibliothéconomie tels que RDF ou RDA. Ce travail consiste à évaluer, au travers d’une recherche qualitative exploratoire, l’adéquation de ces deux modèles et du modèle actuel aux besoins émergents des bibliothécaires qui cataloguent dans les quatre types d’institutions cibles du Réseau des bibliothèques de Suisse occidentale (RERO). Il s’agit également de développer sur la base de plusieurs cas d’utilisation, des modélisations de processus afin de réaliser le maquettage de la nouvelle interface du module de catalogage du SIGB en développement de RERO. Pour y parvenir, nous avons réalisé des entretiens semi-directifs et des observations de catalogage auprès de six bibliothécaires spécialisés en métadonnées bibliographiques, une analyse comparative multicritères, la création de cas d’utilisation et la modélisation UML des processus de catalogage. Le résultat de ce travail consiste en deux parties distinctes. La première est l’analyse qualitative découlant des entretiens. Cette analyse a permis de mettre en exergue le besoin partagé par presque tous les bibliothécaires interrogés d’une souplesse dans les règles de catalogage ainsi qu’une crainte de la complexification du catalogage avec l’adoption de nouveaux standards. Cette analyse ne nous a toutefois pas permis de justifier le choix d’un modèle conceptuel spécifique. Raison pour laquelle la seconde partie de nos résultats : le maquettage de l’interface de catalogage est basé sur un modèle « générique » à trois niveaux descriptifs inspiré par le modèle Bibframe. Il est aujourd’hui important pour RERO de passer à un modèle conceptuel qui permette à la fois, de valoriser ses métadonnées bibliographiques et de rendre l’information contenue dans les catalogues de ses bibliothèques membres plus visibles dans l’écosystème du web des données. et qui n’alourdisse pas les processus de catalogage

    Named entity recognition in chemical patents using ensemble of contextual language models

    Full text link
    Chemical patent documents describe a broad range of applications holding key reaction and compound information, such as chemical structure, reaction formulas, and molecular properties. These informational entities should be first identified in text passages to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction techniques. As part of the Information Extraction task of the Cheminformatics Elsevier Melbourne University challenge, in this work we study the effectiveness of contextualized language models to extract reaction information in chemical patents. We assess transformer architectures trained on a generic and specialised corpora to propose a new ensemble model. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show that ensemble of contextualized language models can provide an effective method to extract information from chemical patents

    Detection of Patients at Risk of Multidrug-Resistant Enterobacteriaceae Infection Using Graph Neural Networks: A Retrospective Study

    Get PDF
    Funding: This research was funded by the Joint Swiss–Portuguese Academic Program from the University of Applied Sciences and Arts Western Switzerland (HES-SO) and the Fundação para a Ciência e Tecnologia (FCT). S.G.P. also acknowledges FCT for her direct funding (CEECINST/00051/2018) and her research unit (UIDB/05704/2020). Funders were not involved in the study design, data pre-processing, data analysis, interpretation, or report writing. Author contributions: R.G. and A.B. designed and implemented the models, and ran the experiments and analyses. R.G. and D.T. wrote the manuscript draft. D.T. and S.G.P. conceptualized the experiments and acquired funding. R.G., D.P., and S.G.P. curated the data. R.G., A.B., D.P., and D.T. analyzed the data. All authors reviewed and approved the manuscript. Competing interests: The authors declare that they have no competing interests.Background: While Enterobacteriaceae bacteria are commonly found in the healthy human gut, their colonization of other body parts can potentially evolve into serious infections and health threats. We investigate a graph-based machine learning model to predict risks of inpatient colonization by multidrug-resistant (MDR) Enterobacteriaceae. Methods: Colonization prediction was defined as a binary task, where the goal is to predict whether a patient is colonized by MDR Enterobacteriaceae in an undesirable body part during their hospital stay. To capture topological features, interactions among patients and healthcare workers were modeled using a graph structure, where patients are described by nodes and their interactions are described by edges. Then, a graph neural network (GNN) model was trained to learn colonization patterns from the patient network enriched with clinical and spatiotemporal features. Results: The GNN model achieves performance between 0.91 and 0.96 area under the receiver operating characteristic curve (AUROC) when trained in inductive and transductive settings, respectively, up to 8% above a logistic regression baseline (0.88). Comparing network topologies, the configuration considering ward-related edges (0.91 inductive, 0.96 transductive) outperforms the configurations considering caregiver-related edges (0.88, 0.89) and both types of edges (0.90, 0.94). For the top 3 most prevalent MDR Enterobacteriaceae, the AUROC varies from 0.94 for Citrobacter freundii up to 0.98 for Enterobacter cloacae using the best-performing GNN model. Conclusion: Topological features via graph modeling improve the performance of machine learning models for Enterobacteriaceae colonization prediction. GNNs could be used to support infection prevention and control programs to detect patients at risk of colonization by MDR Enterobacteriaceae and other bacteria families.info:eu-repo/semantics/publishedVersio

    Extraction des concepts biomédicaux des essais cliniques en utilisant le traitement automatique du langage naturel

    Get PDF
    Les essais cliniques sont des études scientifiques qui permettent d’évaluer l’efficacité de certains médicaments, drogues ou nouvelles méthodes médicales, ainsi que leurs effets secondaires. La plupart du temps, ils se concluent sur un échec. Avoir un outil qui permet d’évaluer le risque d’échec est donc crucial. Ces essais cliniques sont écrits en texte libre, ce qui rend le traitement automatique standard par ordinateur presque impossible. C’est pourquoi l’analyse du langage naturel est utilisée. Le but de ce travail est de créer une base de données qui contient les essais cliniques et les concepts qu’il est possible d’en extraire, pour permettre un traitement automatique dans le futur

    Text mining processing pipeline for semi structured data D3.3

    Get PDF
    Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts
    • …
    corecore